This class heavily relies on the interactive online textbook,
CourseKata. To successfully run any code you learn in class, you need to
load the coursekata package first. Remember to run this
line of code at the top of any script you write in this class!!
Otherwise, you will get an error message that says
could not find function....
We will work on a sample of the US population data,
mydata, from the Census. First, use read.csv()
to load the data from a CSV file.
After loading the dataset, you can inspect it by using functions like
head(), str(), glimpse(), or
summary().
## age education gender native_country income
## 1 39 Bachelors Male United-States 26819
## 2 50 Bachelors Male United-States 28650
## 3 38 High School Male United-States 40974
## 4 53 No HS Diploma Male United-States 30426
## 5 28 Bachelors Female Cuba 30786
## 6 37 Masters Female United-States 42276
## 'data.frame': 32561 obs. of 5 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ education : Factor w/ 5 levels "Bachelors","Doctorate",..: 1 1 3 5 1 4 5 3 4 1 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ native_country: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
## $ income : chr "26819" "28650" "40974" "30426" ...
## age education gender native_country
## Min. :17.00 Bachelors : 5355 Female:10771 United-States:29170
## 1st Qu.:28.00 Doctorate : 413 Male :21790 Mexico : 643
## Median :37.00 High School :20241 ? : 583
## Mean :38.58 Masters : 2299 Philippines : 198
## 3rd Qu.:48.00 No HS Diploma: 4253 Germany : 137
## Max. :90.00 Canada : 121
## (Other) : 1709
## income
## Length:32561
## Class :character
## Mode :character
##
##
##
##
Using common sense, education, sex, and
native_country should be categorical variables.
age and income should be quantitative. Yet
from str() we see that income is mistakenly
coded as characters. Let’s convert it to a numeric variable. Recall
that, to call a variable in R, you need to use the $ sign:
data$variable.
## [1] "numeric"
In R, recode() is a quick way to recode a variable. For
example, we can recode the “?” values in native_country to
“NA” (missing in R).
Let’s create a new variable to record the income level (categorical)
based on the income variable. There are several ways to do
this.
mydata$is_low_income <- mydata$income < 50000 # we create a new variable `is_low_income`, which is TRUE if `income` <30000 and FALSE otherwise.
mydata$income_level <- ifelse(mydata$income > 50000, "high", "low") # we create a new variable `income_level`, which is "high" if `income` > 50000 and "low" otherwise.
mydata$income_group <- factor(ntile(mydata$income, 3), labels = c("low", "medium", "high")) # we create a new variable `income_group`, which divides people into 3 groups, based on the distribution of `income`: low, medium, and high.
# now let's check the new variables we created.
head(mydata, 10)## age education gender native_country income is_low_income income_level
## 1 39 Bachelors Male United-States 26819 TRUE low
## 2 50 Bachelors Male United-States 28650 TRUE low
## 3 38 High School Male United-States 40974 TRUE low
## 4 53 No HS Diploma Male United-States 30426 TRUE low
## 5 28 Bachelors Female Cuba 30786 TRUE low
## 6 37 Masters Female United-States 42276 TRUE low
## 7 49 No HS Diploma Female Jamaica 32897 TRUE low
## 8 52 High School Male United-States 89878 FALSE high
## 9 31 Masters Female United-States 92414 FALSE high
## 10 42 Bachelors Male United-States 40153 TRUE low
## income_group
## 1 low
## 2 low
## 3 high
## 4 medium
## 5 medium
## 6 high
## 7 medium
## 8 high
## 9 high
## 10 high
Sometimes we need to filter the data to focus on a specific group.
For example, we can filter the data to exclude observations with missing
native_country.
mydata_noNA <- mydata %>% filter(native_country != "NA") # we use `filter()` to exclude observations with missing `native_country`.
mydata_noNA <- mydata %>% filter(!is.na(native_country)) # an easier way to do itSimilarly, we can filter the data to include only observations with
income > $50000. The %>% sign is a pipe
operator that passes the result of the previous command to the next
command.
## age education gender native_country income is_low_income income_level
## 1 52 High School Male United-States 89878 FALSE high
## 2 31 Masters Female United-States 92414 FALSE high
## 3 37 High School Male United-States 122256 FALSE high
## 4 30 Bachelors Male India 99868 FALSE high
## 5 40 High School Male <NA> 84153 FALSE high
## 6 43 Masters Female United-States 79537 FALSE high
## income_group
## 1 high
## 2 high
## 3 high
## 4 high
## 5 high
## 6 high
Sometimes we only need a subset of the data. We can use the
sample() function to randomly select a subset of the
data.
## [1] 100
Before you examine the distribution of a variable, you need to know whether it is categorical or quantitative (numerical).
For a categorical variable (such as gender, color, political
affiliation), we use tally() to get a frequency table,
gf_bar to get a bar plot, or gf_percents to
get a percentage plot.
## X
## Bachelors Doctorate High School Masters No HS Diploma
## 5355 413 20241 2299 4253
## X
## Bachelors Doctorate High School Masters No HS Diploma
## 16.446055 1.268389 62.163324 7.060594 13.061638
For a numerical variable (such as age, GDP, income), we can use
favstats() to get its summary statistics, or create a
density plot, histogram, or boxplot to visualize its distribution.
What are the minimum, maximum, median, mean, IQR, and range of
income?
## min Q1 median Q3 max mean sd n missing
## 14761 27425 32462 43989 399502 45410.69 32767.73 32561 0
Now we can visualize the distribution of income.
gf_boxplot(~income, data = mydata) # creates a boxplot, which contains more details like median and range. The distribution is skewed right. Lots of people earn less than $50,000, and a small proportion of people earn much more.
If we are interested in the distribution of a numerical variable
(income) across different categories
(gender):
favstats(income ~ gender, data = mydata) # gives you the summary statistics of `income` across different `gender`.## gender min Q1 median Q3 max mean sd n missing
## 1 Female 15157 26747.00 30877 36626.00 332992 37383.89 23859.92 10771 0
## 2 Male 14761 27863.25 33594 58985.75 399502 49378.41 35714.31 21790 0
Visualizing the distribution of income across different
gender:
gf_histogram(~income, data = mydata) %>% gf_facet_grid(gender ~ .) # creates a histogram of `income` across different `gender`.gf_histogram(~income, fill = ~gender, data = mydata) # creates a histogram of `income` across different `gender`, with different colors.gf_boxplot(income ~ gender, data = mydata) # creates a boxplot of `income` across different `gender`.gf_jitter(income ~ gender, data = mydata) # creates a jitter plot of `income` across different `gender`.Note that in this class, we will not cover the case the other way around (a categorical variable across different numerical variables).
If we are interested in the distribution of two categorical
variables, like gender and education:
tally(gender ~ education, data = mydata) # gives you the frequency of each combination of `gender` and `education`.## education
## gender Bachelors Doctorate High School Masters No HS Diploma
## Female 1619 86 7117 628 1321
## Male 3736 327 13124 1671 2932
gf_bar(~education, data = mydata) %>% gf_facet_grid(. ~ gender) # creates a bar plot of `education` across different `gender`.gf_percents(~education, data = mydata) %>% gf_facet_grid(. ~ gender) # creates a percentage plot of `education` across different `gender`.The simplest model is the empty model, where we use the mean of a
quantitative variable to model its distribution. In a word form:
Income = Mean + Error. In a general linear model form:
\(\text{Income}_i = b_0 + e_i\). DO NOT
FORGET the error term \(e_i\)!!
# create a histogram of `income` and add a vertical line at the mean of `income`.
gf_histogram(~income, data = mydata) %>%
gf_vline(xintercept = ~mean, data = favstats(~income , data = mydata), color ="darkred") # create an empty model
empty_model <- lm(income ~ NULL, data = mydata)
# Visualize the empty model prediction
gf_histogram(~income, data = mydata) %>% gf_model(empty_model) # the same as the previous plot# get the predictions from the empty model
mydata$empty_predicted <- predict(empty_model)
# check your model, it only has one intercept, which equals the mean of `income`.
empty_model##
## Call:
## lm(formula = income ~ NULL, data = mydata)
##
## Coefficients:
## (Intercept)
## 45411
## [1] 45410.69
Let’s check some statistics to evaluate the model.
# data_i = b_0 (mean) + error_i (residual)
# to get the residuals
mydata$empty_residuals <- resid(empty_model)
mean(mydata$empty_residuals) # should be 0## [1] -4.086115e-11
## [1] 742842665
## [1] 3.496045e+13
## [1] 1073723810
## [1] 1073723810
## [1] 32767.73
## [1] 32767.73
The simple linear model is a regression model with one predictor. We refer to the X as the independent (explanatory) variable and Y as the dependent (response) variable.
First, let’s explain variation in income using age, a continuous
variable. In a word form: Income = Age + Error. In a
general linear model form: \(\text{Income}_i =
b_0 + b_1*\text{Age}_i + e_i\) (plug in the estimated
b0 and b1 we get).
age_model <- lm(income ~ age, data = mydata) # explain income using age
summary(age_model) # get the summary of the model##
## Call:
## lm(formula = income ~ age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50676 -17613 -10443 1180 356172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27889.79 534.98 52.13 <2e-16 ***
## age 454.13 13.07 34.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32180 on 32559 degrees of freedom
## Multiple R-squared: 0.03574, Adjusted R-squared: 0.03571
## F-statistic: 1207 on 1 and 32559 DF, p-value: < 2.2e-16
## [1] 27889.79
## [1] 454.1253
# let's create a scatter plot of `income` and `age`, and add the regression line.
gf_point(income ~ age, data = mydata) %>% gf_model(age_model, color = "darkred")What does the b0 and b1 mean here? The
b0 is the intercept, which is the predicted value of
income when age is 0 (hypothetically
speaking). The b1 is the slope, which is the change in
income per a one-unit change in age. Here it
means, for each additional year of age, income increases by
$454.1252595.
Second, let’s explain variation in income using gender, a categorical
variable. In a word form: Income = Gender + Error. In a
general linear model form: \(\text{Income}_i =
b_0 + b_1*\text{Gender}_i + e_i\).
gender_model <- lm(income ~ gender, data = mydata) # explain income using gender
summary(gender_model) # get the summary of the model##
## Call:
## lm(formula = income ~ gender, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34617 -19007 -10845 1491 350124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37383.9 311.0 120.20 <2e-16 ***
## genderMale 11994.5 380.2 31.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32280 on 32559 degrees of freedom
## Multiple R-squared: 0.02966, Adjusted R-squared: 0.02963
## F-statistic: 995.3 on 1 and 32559 DF, p-value: < 2.2e-16
# For a categorical variable, we have a reference level. Here, the reference level is Female. We can tell it from the summary output -- the slope is called `genderMale`.
b0(gender_model) # get the intercept## [1] 37383.89
## [1] 11994.52
# let's create a histogram of `income` and `gender`, and add the regression line.
gf_histogram(~income, data = mydata) %>% gf_facet_grid(gender ~ .) %>%
gf_model(gender_model) %>% # gender model: mean income for female and male, separately.
gf_model(empty_model, color = "darkred") # empty model: mean income for all.What does the b0 and b1 mean here? The
b0 is the intercept, which is the mean income
for the reference level (here it is Female). The b1 is the
slope, which is the difference in income between the
reference level and the other level (here it is Male). Here it means,
the average income of Male is $1.1994522^{4} higher than female.
We have “Total = Model + Error” in the ANOVA table. The
Model is the variation in income explained by
the model, and the Error is the variation in
income not explained by the model.
## Analysis of Variance Table (Type III SS)
## Model: income ~ age
##
## SS df MS F PRE p
## ----- --------------- | ------------ ----- ------------ -------- ----- -----
## Model (error reduced) | 1.249373e+12 1 1.249373e+12 1206.675 .0357 .0000
## Error (from model) | 3.371107e+13 32559 1.035384e+09
## ----- --------------- | ------------ ----- ------------ -------- ----- -----
## Total (empty model) | 3.496045e+13 32560 1.073724e+09
SS, refers to the sum of
squares. For example, Sum of Squared Errors (SSE) is the
SS value in the Error row, which is
3.3711075^{13}.df, refers to the degrees of
freedom, which depends on the number of observations and
variables.MS, refers to the mean
square. MS is calculated by dividing
SS by df: for example,
MSE = SSE/df_Error.Let’s compare the Age model and Gender model using PRE and F-ratio. Let’s also print out the ANOVA table for the gender model.
## Analysis of Variance Table (Type III SS)
## Model: income ~ gender
##
## SS df MS F PRE p
## ----- --------------- | ------------ ----- ------------ ------- ----- -----
## Model (error reduced) | 1.037006e+12 1 1.037006e+12 995.297 .0297 .0000
## Error (from model) | 3.392344e+13 32559 1.041907e+09
## ----- --------------- | ------------ ----- ------------ ------- ----- -----
## Total (empty model) | 3.496045e+13 32560 1.073724e+09
F, refers to the
F-ratio. It is calculated as
F = MS_Model / MS_Error. It represents how much variation
is explained by the model per degree of freedom (the variance explained
by the model is F-ratio times larger than the leftover
variance unexplained by the model). It can be higher than 1. We prefer
model with higher F-ratio.PRE, refers to the PRE
(Proportional Reduction in Error). It is calculated as
PRE = SS_Model / SS_Total. It measures how much variation
in Y is explained by X (from 0 to 1), in other words, how much error is
reduced. We prefer model with higher PRE.A sampling distribution is a distribution of sample
statistics, computed from randomly generated samples of the same size.
We can use shuffle() or resample for the
random data generating process (DGP):
# shuffle the income: create a new empty model
mydata$shuffled_income <- shuffle(mydata$income)
summary(lm(shuffled_income ~ age, data = mydata)) # the new age model##
## Call:
## lm(formula = shuffled_income ~ age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30499 -17977 -12941 -1437 353824
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44818.12 544.79 82.267 <2e-16 ***
## age 15.36 13.31 1.154 0.249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32770 on 32559 degrees of freedom
## Multiple R-squared: 4.088e-05, Adjusted R-squared: 1.016e-05
## F-statistic: 1.331 on 1 and 32559 DF, p-value: 0.2486
# shuffle the age variable
sample_b1 <- b1(income ~ age, data = mydata)
sdob1 <- do(100) * b1(income ~ shuffle(age), data = mydata) # a sample with 100 obs
# alternatively: sdob1 <- do(100) * b1(income ~ age, data = resample(mydata))
# how often do we get a b1 value as or more extreme as the sample b1?
tally(sdob1$b1 > sample_b1 | sdob1$b1 < -sample_b1)## X
## TRUE FALSE
## 0 100
Now, let’s look at the summary of the age model again. What do these values mean?
##
## Call:
## lm(formula = income ~ age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50676 -17613 -10443 1180 356172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27889.79 534.98 52.13 <2e-16 ***
## age 454.13 13.07 34.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32180 on 32559 degrees of freedom
## Multiple R-squared: 0.03574, Adjusted R-squared: 0.03571
## F-statistic: 1207 on 1 and 32559 DF, p-value: < 2.2e-16
Estimate: our b0 and b1
values, our estimated parameters.Std. Error stands for the standard
error. It is the standard deviation of the sampling
distribution of the parameter estimate. We have
SE = SD/sqrt(n).t value is the standardized distance from 0 in the
student t distribution. We have t-value = Estimate / SE.
The larger the t-value, the more extreme the estimated coefficient is in
the sampling distribution of an empty model.Pr(>|t|) refers to the p-value,
which is the probability of getting a parameter estimate as extreme or
more extreme than the sample estimate given the assumption that the
empty model is true (null hypothesis). We prefer a model with a lower
p-value.Confidence Interval (CI): We know that in the normal
distribution, 95% of the distribution is within 1.96 standard
deviations. Therefore, for example, the lower bound of b1
is b1-1.96*SE, and the upper bound of b1 is
b1+1.96*SE. To get the CI of b0 and
b1 in R, we can run:
## 2.5 % 97.5 %
## (Intercept) 26841.2112 28938.3641
## age 428.5014 479.7491
## [1] 428.5128 479.7472
## 0.5 % 99.5 %
## (Intercept) 26511.6951 29267.8803
## age 420.4491 487.8015
From the first output, we are 95% confident that the true value of
b1 is between 428.5 and 479.7. It does not include 0, we
may reject the null hypothesis with 95% confidence.
The multivariate model is a regression model with multiple
predictors. For example, it makes sense to explain variation in income
using both age and gender. In a word form:
Income = Age + Gender + Error. In a general linear model
form: \(\text{Income}_i = b_0 +
b_1*\text{Age}_i + b_2*\text{Gender}_i + e_i\).
# explain income using age and education
age_gender_model <- lm(income ~ age + gender, data = mydata)
summary(age_gender_model) ##
## Call:
## lm(formula = income ~ age + gender, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47908 -18283 -9735 3197 352409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21877.91 567.30 38.56 <2e-16 ***
## age 420.69 12.96 32.47 <2e-16 ***
## genderMale 10911.11 375.68 29.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31770 on 32558 degrees of freedom
## Multiple R-squared: 0.06009, Adjusted R-squared: 0.06003
## F-statistic: 1041 on 2 and 32558 DF, p-value: < 2.2e-16
So, in a GLM form, the model is: \(\text{Income}_i = 21877.9 + 420.7*\text{Age}_i +
10911.1*\text{Gender}_i + e_i\). How should we interpret these
coefficients? \(b_1\) is the change in
income per a one-unit change in age while
holding gender fixed, and \(b_2\) is
the difference in income between male and female (the
reference level) while holding age fixed.
## Analysis of Variance Table (Type III SS)
## Model: income ~ age + gender
##
## SS df MS F PRE p
## ------ --------------- | ------------ ----- ------------ -------- ----- -----
## Model (error reduced) | 2.100731e+12 2 1.050366e+12 1040.721 .0601 .0000
## age | 1.063725e+12 1 1.063725e+12 1053.958 .0314 .0000
## gender | 8.513584e+11 1 8.513584e+11 843.541 .0253 .0000
## Error (from model) | 3.285972e+13 32558 1.009267e+09
## ------ --------------- | ------------ ----- ------------ -------- ----- -----
## Total (empty model) | 3.496045e+13 32560 1.073724e+09
Compared to the age model, the age+gender model has a lower F-ratio but a higher PRE.
We can add an interaction term between age and
gender to the model.
##
## Call:
## lm(formula = income ~ age * gender, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54344 -17044 -9119 1926 353101
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30518.03 859.04 35.526 <2e-16 ***
## age 186.28 21.79 8.551 <2e-16 ***
## genderMale -2748.16 1088.45 -2.525 0.0116 *
## age:genderMale 361.70 27.06 13.366 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31680 on 32557 degrees of freedom
## Multiple R-squared: 0.06522, Adjusted R-squared: 0.06513
## F-statistic: 757.2 on 3 and 32557 DF, p-value: < 2.2e-16